NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

IcedTea: Efficient and Responsive Time-Travel Debugging in Dataflow Systems

https://doi.org/10.14778/3712221.3712251

Ni, Shengquan; Huang, Yicong; Wang, Zuozhi; Li, Chen (September 2025, Proceedings of the VLDB Endowment)

Dataflow systems have an increasing need to support a wide range of tasks in data-centric applications using latest techniques such as machine learning. These tasks often involve custom functions with complex internal states. Consequently, users need enhanced debugging support to understand runtime behaviors and investigate internal states of dataflows. Traditional forward debuggers allow users to follow the chronological order of operations in an execution. Therefore, a user cannot easily identify a past runtime behavior after an unexpected result is produced. In this paper, we present a novel time-travel debugging paradigm called IcedTea, which supports reverse debugging. In particular, in a dataflow's execution, which is inherently distributed across multiple operators, the user can periodically interact with the job and retrieve the global states of the operators. After the execution, the system allows the user to roll back the dataflow state to any past interactions. The user can use step instructions to repeat the past execution to understand how data was processed in the original execution. We give a full specification of this powerful paradigm, study how to reduce its runtime overhead and develop techniques to support debugging instructions responsively. Our experiments on real-world datasets and workflows show that IcedTea can support responsive time-travel debugging with low time and space overhead.
more » « less
Free, publicly-accessible full text available September 1, 2026
Pasta: A Cost-Based Optimizer for Generating Pipelining Schedules for Dataflow DAGs

Liu, Xiaozhen; Huang, Yicong; Lin, Xinyuan; Kumar, Avinash; Alsudais, Sadeem; Li, Chen (June 2025, ACM SIGMOD)

Free, publicly-accessible full text available June 22, 2026
Demonstration of Udon: Line-by-line Debugging of User-Defined Functions in Data Workflows

https://doi.org/10.1145/3626246.3654756

Huang, Yicong; Wang, Zuozhi; Li, Chen (June 2024, ACM)

Full Text Available
Brain image data processing using collaborative data workflows on Texera

https://doi.org/10.3389/fncir.2024.1398884

Ding, Yunyan; Huang, Yicong; Gao, Pan; Thai, Andy; Chilaparasetti, Atchuth Naveen; Gopi, M; Xu, Xiangmin; Li, Chen (July 2024, Frontiers in Neural Circuits)

In the realm of neuroscience, mapping the three-dimensional (3D) neural circuitry and architecture of the brain is important for advancing our understanding of neural circuit organization and function. This study presents a novel pipeline that transforms mouse brain samples into detailed 3D brain models using a collaborative data analytics platform called “Texera.” The user-friendly Texera platform allows for effective interdisciplinary collaboration between team members in neuroscience, computer vision, and data processing. Our pipeline utilizes the tile images from a serial two-photon tomography/TissueCyte system, then stitches tile images into brain section images, and constructs 3D whole-brain image datasets. The resulting 3D data supports downstream analyses, including 3D whole-brain registration, atlas-based segmentation, cell counting, and high-resolution volumetric visualization. Using this platform, we implemented specialized optimization methods and obtained significant performance enhancement in workflow operations. We expect the neuroscience community can adopt our approach for large-scale image-based data processing and analysis.
more » « less
Full Text Available
Data Science Tasks Implemented with Scripts versus GUI-Based Workflows: The Good, the Bad, and the Ugly

https://doi.org/10.1109/icdew61823.2024.00040

Taylor, Alexander K; Huang, Yicong; Hao, Junheng; Lin, Xinyuan; Chen, Xiusi; Wang, Wei; Li, Chen (May 2024, IEEE)

Full Text Available
Udon: Efficient Debugging of User-Defined Functions in Big Data Systems with Line-by-Line Control

https://doi.org/10.1145/3626712

Huang, Yicong; Wang, Zuozhi; Li, Chen (December 2023, Proceedings of the ACM on Management of Data)

Many big data systems are written in languages such as C, C++, Java, and Scala to process large amounts of data efficiently, while data analysts often use Python to conduct data wrangling, statistical analysis, and machine learning. User-defined functions (UDFs) are commonly used in these systems to bridge the gap between the two ecosystems. In this paper, we propose Udon, a novel debugger to support fine-grained debugging of UDFs. Udon encapsulates the modern line-by-line debugging primitives, such as the ability to set breakpoints, perform code inspections, and make code modifications while executing a UDF on a single tuple. It includes a novel debug-aware UDF execution model to ensure the responsiveness of the operator during debugging. It utilizes advanced state-transfer techniques to satisfy breakpoint conditions that span across multiple UDFs. It incorporates various optimization techniques to reduce the runtime overhead. We conduct experiments with multiple UDF workloads on various datasets and show its high efficiency and scalability.
more » « less
Full Text Available
Wording Matters: the Effect of Linguistic Characteristics and Political Ideology on Resharing of COVID-19 Vaccine Tweets

https://doi.org/10.1145/3637876

Borghouts, Judith; Huang, Yicong; Hopfer, Suellen; Li, Chen; Mark, Gloria (January 2024, ACM Transactions on Computer-Human Interaction)

Social media platforms are frequently used to share information and opinions around vaccinations. The more often a message is reshared, the wider the reach of the message and potential influence it may have on shaping people’s opinions to get vaccinated or not. We used a negative binomial regression to investigate whether a message’s linguistic characteristics (degree of concreteness, emotional arousal, and sentiment) and user characteristics (political ideology and number of followers) may influence users’ decisions to reshare tweets related to the COVID-19 vaccine. We analyzed US English-language tweets related to the COVID-19 vaccine between May 2020 and October 2021 (N = 236,054). Tweets with positive and high-arousal words were more often retweeted than negative, low-arousal tweets. Tweets with abstract words were more often retweeted than tweets with concrete words. In addition, while Liberal users were more likely to have tweets with a positive sentiment reshared, Conservative users were more likely to have tweets with a negative sentiment reshared. Our results can inform public health messaging on how to best phrase vaccine information to impact engagement and information resharing, and potentially persuade a wider set of people to get vaccinated.
more » « less
Full Text Available
Pasta: A Cost-Based Optimizer for Generating Pipelining Schedules for Dataflow DAGs

https://doi.org/10.1145/3698832

Liu, Xiaozhen; Huang, Yicong; Lin, Xinyuan; Kumar, Avinash; Alsudais, Sadeem; Li, Chen (December 2024, Proceedings of the ACM on Management of Data)

Data analytics tasks are often formulated as data workflows represented as directed acyclic graphs (DAGs) of operators. The recent trend of adopting machine learning (ML) techniques in workflows results in increasingly complicated DAGs with many operators and edges. Compared to the operator-at-a-time execution paradigm, pipelined execution has benefits of reducing the materialization cost of intermediate results and allowing operators to produce results early, which are critical in iterative analysis on large data volumes. Correctly scheduling a workflow DAG for pipelined execution is non-trivial due to the richer semantics of operators and the increasing complexity of DAGs. Several existing data systems adopt simple heuristics to solve the problem without considering costs such as materialization sizes. In this paper, we systematically study the problem of scheduling a workflow DAG for pipelined execution, and develop a novel cost-based optimizer called Pasta for generating a high-quality schedule. The Pasta optimizer is not only general and applicable to a wide variety of cost functions, but also capable of utilizing properties inherent in a broad class of cost functions to improve its performance significantly. We conducted a thorough evaluation of developed techniques on real-world workflows and show the efficiency and efficacy of these solutions.
more » « less
Full Text Available
The Marketing and Perceptions of Non-Tobacco Blunt Wraps on Twitter

https://doi.org/10.1080/10826084.2023.2280572

Rhee, Joshua U; Huang, Yicong; Soroosh, Aurash J; Alsudais, Sadeem; Ni, Shengquan; Kumar, Avinash; Paredes, Jacob; Li, Chen; Timberlake, David S (November 2023, Substance Use & Misuse)

Full Text Available
How the experience of California wildfires shape Twitter climate change framings

https://doi.org/10.1007/s10584-023-03668-0

Ko, Jessie_W Y; Ni, Shengquan; Taylor, Alexander; Chen, Xiusi; Huang, Yicong; Kumar, Avinash; Alsudais, Sadeem; Wang, Zuozhi; Liu, Xiaozhen; Wang, Wei; et al (January 2024, Climatic Change)

Abstract Climate communication scientists search for effective message strategies to engage the ambivalent public in support of climate advocacy. The personal experience of wildfire is expected to render climate change impacts more concretely, pointing to a potential message strategy to engage the public. This study examined Twitter discourse related to climate change during the onset of 20 wildfires in California between the years 2017 and 2021. In this mixed method study, we analyzed tweets geographically and temporally proximal to the occurrence of wildfires to discover framings and examined how frequencies in climate framings changed before and after fires. Results identified three predominant climate framings: linking wildfire to climate change, suggesting climate actions, and attributing climate change to adversities besides wildfires. Mean tweet frequencies linking wildfire to climate change and attributing adversities increased significantly after the onset of fire. While suggesting climate action tweets also increased, the increase was not statistically significant. Temporal analysis of tweet frequencies for the three themes of tweets showed that discussion increased after the onset of a fire but persisted typically no more than 2 weeks. For fires that burned for longer periods of more than a month, external events triggered climate discussions. Our findings contribute to identifying how the personal experience of wildfire shapes Twitter discussion related to climate change, and how these framings change over time during wildfire events, leading to insights into critical time points after wildfire for implementing message strategies to increase public engagement on climate change impacts and policy.
more » « less
Full Text Available

« Prev Next »

Search for: All records